Exploiting Dataset Similarity for Distributed Mining

نویسندگان

Srinivasan Parthasarathy

Mitsunori Ogihara

چکیده

The notion of similarity is an important one in data mining. It can be used to provide useful structural information on data as well as enable clustering. In this paper we present an elegant method for measuring the similarity between homogeneous datasets. The algorithm presented is eÆcient in storage and scale, has the ability to adjust to time constraints. and can provide the user with likely causes of similarity or dis-similarity. One potential application of our similarity measure is in the distributed data mining domain. Using the notion of similarity across databases as a distance metric one can generate clusters of similar datasets. Once similar datasets are clustered, each cluster can be independently mined to generate the appropriate rules for a given cluster. The similarity measure is evaluated on a dataset from the Census Bureau, and synthetic datasets from IBM.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering

Classification is an one of the important parts of data mining and knowledge discovery. In most cases, the data that is utilized to used to training the clusters is not well distributed. This inappropriate distribution occurs when one class has a large number of samples but while the number of other class samples is naturally inherently low. In general, the methods of solving this kind of prob...

متن کامل

Improve The Linear Regression Model in Bioinformatics Using Text Mining

Linear regression is a commonly used approach in bioinformatics. One of the main challenge with applying linear regression in bioinformatics is that the number of regression weights needed to be determined is often at least one order of magnitude larger than the number of data points available for training. This sparse data problem often reduce the reliability in determining regression weights,...

متن کامل

Distributed Privacy Preserving Data Mining: A framework for k-anonymity based on feature set partitioning approach of vertically fragmented databases

Recently, many data mining algorithms for discovering and exploiting patterns in data are developed and the amount of data about individuals that is collected and stored continues to rapidly increase. However, databases containing information about individuals may be sensitive and data mining algorithms run on such data sets may violate individual privacy. Also most organizations collect and sh...

متن کامل

Instance reduction approach to machine learning and multi-database mining

The paper proposes a heuristic instance reduction algorithm as an approach to machine learning and knowledge discovery in centralized and distributed databases. The proposed algorithm is based on an original method for a selection of reference instances and creates a reduced training dataset. The reduced training set consisting of selected instances can be used as an input for the machine learn...

متن کامل

Rights Protection of Multidimensional Time-Series Datasets with Neighborhood Preservation

Industry companies frequently outsource datasets to mining firms and academic institutions create repositories and share datasets in the interest of promoting research collaboration. Still, many practitioners feel reserved about about sharing or outsourcing datasets, primarily because of the fear of losing the principal rights over the dataset. This work presents a way of convincingly claiming ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2000

Exploiting Dataset Similarity for Distributed Mining

نویسندگان

چکیده

منابع مشابه

Improving Imbalanced data classification accuracy by using Fuzzy Similarity Measure and subtractive clustering

Improve The Linear Regression Model in Bioinformatics Using Text Mining

Distributed Privacy Preserving Data Mining: A framework for k-anonymity based on feature set partitioning approach of vertically fragmented databases

Instance reduction approach to machine learning and multi-database mining

Rights Protection of Multidimensional Time-Series Datasets with Neighborhood Preservation

عنوان ژورنال:

اشتراک گذاری